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ABSTRACT 


Deployment of a distributed deep learning technology stack on a large parallel system is a very com- 
plex process, involving the integration and configuration of several layers of both, general-purpose 
and custom software. The details of such kind of deployments are rarely described in the literature. 
This paper presents the experiences observed during the deployment of a technology stack to enable 
deep learning workloads on MareNostrum, a petascale supercomputer. The components of a layered 
architecture, based on the usage of Apache Spark, are described and the performance and scalability 
of the resulting system is evaluated. This is followed by a discussion about the impact of different 
configurations including parallelism, storage and networking alternatives, and other aspects related to 
the execution of deep learning workloads on a traditional HPC setup. The derived conclusions should 


be useful to guide similarly complex deployments in the future. 


1. Introduction 


Over the past several years, deep neural networks (DNNs) 
have proven to be an incredibly effective tool for a variety of 
problems, from computer vision, speech recognition or nat- 
ural language processing. Their number of parameters and 
complexity, and the size of the training datasets, have quickly 
grown, leading to be a first-class workload for HPC (High- 
Performance Computing) infrastructures. However, enabling 
deep learning workloads on a large parallel system is a very 
complex process, involving the integration and configuration of 
several layers of both, general-purpose and custom software. 
The details of such kind of deployments are rarely described 
in the literature. This paper presents the experiences observed 
during the deployment of a technology stack to enable deep 
learning workloads on a real-world, petascale, HPC setup, the 
MareNostrum supercomputer. 

The goal of the deployment is to be able to take profit of 
the computation resources provided by MareNostrum (almost 
50K cores and more than 100TB of aggregated RAM) for train- 
ing DNNs. Nowadays, the usage of GPUs has proven to be 
the more efficient alternative to train neural networks, speeding 
up common operations such as large matrix computations (Lee 
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et al. (2010); Fujimoto (2008)). As their price, performance and 
energy efficiency improves, GPUs are gaining ground in HPC 
(both in special-purpose systems and in hybrid general-purpose 
supercomputers). However, there are still many systems, such 
as MareNostrum, that are not equipped with GPUs. 

The key element of the deployed layered architecture is 
Apache Spark (Zaharia et al.). In order to isolate machine- 
learning applications from the particularities of MareNostrum, 
Spark is usually used as an intermediate layer (not only in 
MareNostrum, Wang et al. (2014) does the same on a Cray X- 
series supercomputer). The deployment of Spark-enabled clus- 
ters over MareNostrum is not trivial, it has required the develop- 
ment of a specific interoperability layer that we call Spark4MN, 
which will be explained later. On top of this stack (Marenos- 
trum, Spark4MN and Spark) we place a deep learning specific 
layer, DL4J. DLAJ, that is written in Java and has a direct inte- 
gration with Spark, enables distributed training of deep neural 
networks through a synchronous data parallelism method. 

These four elements (DL4J, Spark, Spark4MN and 
MareNostrum) have been integrated enabling to efficiently train 
deep neural networks. Apart from the deployment details, the 
challenge is scalability and proper configuration. Simply run- 
ning on many cores may yield poor benefits or even degraded 
performance due to overheads. We deal with this issue and we 
aim to make the first step towards systematic analysis of the 
several parameters and optimized configuration. 

In order to evaluate the performance and scalability of 
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the proposed software stack on MareNostrum, we have ex- 
perimented with different workloads and different deploy- 
ment setups (number of nodes, parallelism configuration, etc.). 
Through the following sections we explain the different com- 
ponents of the deployment in more detail. Then, we discuss the 
performed experiments and the obtained results, aiming to shed 
light onto the parameters that have the biggest impact and their 
effective configuration. We provide insights into how the job 
configuration on a traditional HPC setup can be optimized to 
efficiently run this kind of workloads. The derived conclusions 
should be useful to guide similarly complex deployments in the 
future. 


2. Related Work 


Several works have addressed the execution of deep learn- 
ing workloads on large specific purpose clusters usually involv- 
ing nodes equipped with GPUs. In Kurth et al. (2017), authors 
present a Caffe-based approach to execute deep learning work- 
loads on a contemporary HPC system equipped with Xeon-Phi 
nodes. They use the Intel distribution of Caffe, that improves 
Caffe performance when running on CPUs. Authors report to 
be able, due to a hybrid approach, to overcome the limitations of 
synchronous systems scaling the training of a model up to thou- 
sands of nodes. In You et al. (2017), authors describe another 
method (tested over KNL clusters and multi-GPU clusters) with 
very good weak scaling efficiency (e.g. 92% for GoogleNet 
on 2176 cores with respect to a Intel Caffe baseline). Alterna- 
tively, distributed DNNs training can be deployed through an 
integrated software stack. Despite of the potential performance 
limitations, the possibility to take profit of thousands of under- 
utilized cores to alleviate the pressing demand of computational 
resources to train deep learning models with a solution with 
minimum cost and setup time is an option for many general- 
purpose HPC infrastructures, specially if they already provide 
the lower components of the stack. A common case are in- 
fraestructures with an Apache Spark abstraction layer. Enabling 
distributed DNNs training in these situations is straightforward 
through the integration of a DL4J layer. While the performance 
of Spark on HPC setups have been already studied by many 
works (e.g. Michael et al. (2014), Wang et al. (2014), Haut et al. 
(2017)), as far as we know, there are no previous works eval- 
uating the feasibility and scalabity of a Spark-DL4J integrated 
solution when applied to an HPC setup. One potential limi- 
tation of DL4J is that its currently constrained to synchronous 
SGD-based training. The work described in Keuper and Pfre- 
undt (2016) analyzes the main bottlenecks of the synchronous 
approach. The authors conclude that the issue is quickly turning 
into a vastly communication bound problem which is severely 
limiting the scalability in most practical scenarios. 


3. Deep Neural Networks 


Deep neural networks (DNNs) are layered compositional 
models that enable learning representations of data with mul- 
tiple levels of abstraction. State-of-the-art DNNs include many 
variants, specialized in different domains (convolutional deep 
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neural networks, recurrent neural networks, etc.). DNNs are 
usually trained by using iterative, gradient-based optimizers 
(typically mini-batch SGD) that drive a non-convex cost func- 
tion to a local minima. In every iteration step, we use informa- 
tion about the gradient VE at the current point. In iteration step 
[t+ 1] the weight update Aw[f] is determined by taking a step (y 
is the learning rate) into the direction of the negative gradient at 
position w[t] such that (in the case of stochastic training): 


ôE, 
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Aw[t] = -y (1) 
State-of-the-art networks have a huge number of weights W and 
the core computation in their training is dominated by dense 
linear algebra. Usually, in order to improve the efficiency, the 
training dataset is split into mini-batches of size B (typically 
chosen between 1 and a few hundreds) and the model is only 
updated (one iteration) after accumulating the gradients of all 
the training samples within a mini-batch. 

DNNs training on a single node involves several software 
and hardware layers. At the top of the stack there is normally 
a deep learning framework such as DL4J, TensorFlow, Torch, 
etc. (there may be even an upper layer such as Keras). Below, 
the framework relies on an underlying numerical library such 
as NVIDIA’s cuDNN or Intel’s MKL. Finally, the models are 
usually trained on NVIDIA GPUs or Intel’s Xeon Phi proces- 
sors. 

When trained on multiple nodes, one can apply data par- 
allelism (distributing training samples among nodes) and/or 
model parallelism (distributing model parameters among 
nodes). In our deployment, we only apply data parallelism. 
The B training samples within a min-batch are split into n equal 
sized sets of size b (with b = B/n). The resulting mini-batch- 
splits are then fed to n nodes holding a complete copy of the 
model. The results (gradients) off all nodes are then accumu- 
lated and used to update the model. 

While DL4J limits us to perform this process synchronously 
(awaiting all the workers to finish before updating the model), 
it could be also performed asynchronously (allowing model up- 
dates with just a part of nodes results). Asynchronous data par- 
allelism can potentially gain higher throughput, but depending 
on the infrastructure status we can have the stale gradient prob- 
lem. By the time a slow worker has finished its calculations 
based on a given state of the model, the model may have been 
updated a number of times and the outdated update may have a 
negative impact. Some solutions to this problem (e.g. Nguyen 
et al. (2018)) have been recently proposed. 


4. DL4J 


DL4J (or Deeplearning4j) is a computing framework written 
for Java with wide support for deep learning algorithms. DL4J 
is powered by its own numerical computing library, ND4J, 
and provides distributed parallel versions (both for GPUs and 
CPUs) of the algorithms that integrate with Apache Hadoop and 
Spark. Through a C++ native library, Libnd4j (with a BLAS 
backend), ND4J provides intra-node parallelism for matrix op- 
erations (implemented with OpenMP vectorizable loops with 
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Fig. 1. Parameter averaging in DL4J over Spark (example using two mini- 
batches). 


SIMD support). With the help of JavaCPP and JNI, pointers to 
off-heap memory (allocated outside of the JVM and not man- 
aged by the GC) are passed to the underlying C++ code. In 
order to achieve distributed network training over Spark, DL4J 
performs a version of the synchronous data parallelism mecha- 
nism called parameter averaging. Instead of transferring gradi- 
ents to the master, the nodes perform first a local model update 
and then they transfer the resulting weights to the master, where 
they are averaged. With respect to generic parameter averaging, 
in DL4J the Spark driver and reduction operations take the place 
of the parameter server (see Figure 1). 

There are several parameters that must be adjusted to opti- 
mize training time. These include, but are not limited to, mini- 
batch-split size, averaging frequency (too low averaging periods 
may imply too networking overhead), prefetching (how many 
mini-batch-splits a worker must prefetch to avoid waiting for 
the data to be loaded), and repartitioning strategy (when and 
how to repartition data to keep the partitions balanced). 


5. Apache Spark 


As mentioned before, Apache Spark is the key component 
of the proposed framework. Spark is a distributed system for 
processing data-intensive workloads. It excels in an efficient 
memory usage, outperforming Hadoop for many applications 
(Zaharia et al. (2012)). Spark is being used to execute big data 
workloads on the MareNostrum supercomputer, isolating the 
applications from the particularities of this HPC infrastructure. 
Spark is designed to avoid the file system as much as possi- 
ble, retaining most data resident in distributed memory across 
phases in the same job. Such memory-resident feature stands 
to benefit many applications, such as machine learning or clus- 
tering, that require extensive reuse of results across multiple 
iterations. Essentially, Spark is an implementation of the so- 
called Resilient Distributed Dataset (RDD) abstraction (Zaharia 
et al. (2012)), which hides the details of distribution and fault- 
tolerance for large collections of items. The usage of Spark 
over alternatives with potentially better performance, e.g. MPI, 
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was a prerequisite of the described deployment. While Spark 
has some advantages such as fault tolerance, its main advan- 
tage over other alternatives in this case was that it minimized 
the deployment cost. 


6. The Spark4MN Framework 


The MareNostrum supercomputer is accessed through an 
IBM LSF Platform workload manager. In order to be able to de- 
ploy Spark clusters over MareNostrum, we employ an interme- 
diate layer called Spark4MN (Tous et al. (2015)). Spark4MN 
is also in charge of managing the deployment of any additional 
resource Spark needs, such as a service-based distributed file 
system (DFS) like HDFS. Essentially, Spark4MN is a collec- 
tion of bash scripts that deploy the Spark cluster’s services, and 
executes the user applications. Spark4MN scripts read a con- 
figuration file, describing the application and the Spark cluster 
configuration, and submit one or more jobs to the MareNostrum 
workload manager. Once the cluster’s job scheduler chooses a 
Spark4MN job to be executed, an exclusive number of cluster’s 
nodes are reserved for the Spark cluster and (if requested) for 
the DFS (e.g. HDFS) cluster (may be the same nodes, depend- 
ing on the configuration). After the resource allocation proce- 
dure, Spark4MN starts the different services. In Spark4MN, 
the Spark master corresponds to the standalone Spark manager, 
and workers are Spark worker services, where the Spark ex- 
ecutors are received and launched. The cluster startup requires 
about 12 seconds. This is independent of the size of the cluster 
(the number of nodes). Each application is executed via spark- 
submit calls. During each Spark job execution, intermediate 
data is produced, e.g., due to shuffling. Such data are stored on 
the local disks and not on DFS by default (as in Michael et al. 
(2014), this yields the best performance). Finally, Spark time- 
outs are automatically configured to the maximum duration of 
the job, as set by the user. 


7. Marenostrum supercomputer 


MareNostrum is the Spanish Tier-0 supercomputer provided 
by BSC. It is an IBM System X iDataplex based on Intel Sandy 
Bridge EP processors at 2.6 GHz (two 8-core Intel Xeon pro- 
cessors E5-2670 per machine), 2 GB/core (32 GB/node) and 
around 500 GB of local disk (IBM 500 GB 7.2K 6Gbps NL 
SATA 3.5). Currently the supercomputer consists of 48896 In- 
tel Sandy Bridge cores in 3056 JS21 nodes, with more than 
104.6 TB of main memory and 2 PB of GPFS (General Parallel 
File System) disk storage. More specifically, GPFS provides 
1.9 PB for user data storage, 33.5 TB for metadata storage (in- 
odes and internal filesystem data) and total aggregated perfor- 
mance of 15GB/s. The GPFS filesystems are configured and 
optimized to be mounted on 3000 nodes. All compute nodes 
are interconnected through an Infiniband FDR10 network, with 
a non-blocking fat tree network topology. In addition to the 40 
Gb/s Infiniband, 1 Gb/s full duplex Ethernet is in place. With 
the last upgrade, MareNostrum has a peak performance of 1.1 
Petaflops. 


8. Experiments and Results 


The main goal of the experiments is to evaluate the scala- 
bility properties of the proposed deployment. To this end, we 
have experimented with different workloads and different de- 
ployment setups. Regarding the benchmarking workloads, we 
have chosen two widely used convolutional networks, AlexNet 
(Krizhevsky et al. (2017)) and GoogLeNet (Szegedy et al. 
(2015)). Both networks have been used in other state-of-the-art 
works (e.g. Keuper and Pfreundt (2016)) and let us compare our 
results with others. While AlexNet implements a rather shal- 
low network with many parameters, GoogLeNet is a very deep 
network with many convolutional layers. We apply both net- 
works to dataset of the ImageNet (Russakovsky et al. (2015)) 
visual recognition challenge. For reproducibility, we stick to 
the ILSVRC2012 classification task training and test datasets 
and their standard evaluation procedure. 


Table 1. Properties of the deep neural networks used in the experiments. 


AlexNet GoogLeNet 
Default batch size 256 32 
Default step-size 0.1 0.1 
# Iterations till convergence 450k 1000k 
# Layers 25 159 
# Convolutional layers 5 59 
# Fully-connected (FC) layers | 3 1 
# Weights in FC layers 55M 1M 


Regarding the deployment setup, we have tested different 
values for the number of nodes, the number of Spark workers 
per node, the Spark data partition size, the DL4J averaging fre- 
quency and the persistence level. Figure 2 shows the speedup 
results obtained with B = 256 and B = 1024 (two Spark work- 
ers per node, averaging each 3 mini-batch-splits, Spark’s per- 
sistence level set to MEMORY_AND_DISK_SER and automatic 
partitioning). The step sizes were increased according to the 
batch size as suggested by Iandola et al. (2016), while the num- 
ber of iterations has been decreased by the same factor. For 
each different number of nodes n, each node processes mini- 
batch splits of size b = B/n. 

Under a basic setup (averaging for each computed mini- 
batch and uniform node workload) synchronous data paral- 
lelism trough parameter averaging is mathematically equiva- 
lent to a non-parallel computation and yields the same accu- 
racy results. However, accuracy degrades (regardless of par- 
allelization) when mini-batch sizes become too large (Keskar 
et al. (2016)), which imposes a constraint on scalability. Table 
2 shows the accuracy results for the different configurations. 


Table 2. Impact on accuracy of the different configurations. 


mini-batch size | accuracy 
AlexNet (B = 256) 56.9% 
AlexNet (B = 1024) 53.6% 
GoogLeNet (B = 256) 67.1% 
GoogLeNet (B = 1024) | 65.4% 
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Fig. 2. Speedup results for AlexNet (Krizhevsky et al. (2017)) and 
GoogLeNet (Szegedy et al. (2015)) with different mini-batch sizes B. 


The results of our evaluation show that DL4J and Spark 
are able to scale deep learning workloads over MareNostrum. 
However, the effective scaling stops above 32 nodes with the 
best configurations. This limitation agrees with the results re- 
ported in Keuper and Pfreundt (2016), that studies the theoretic 
constraints of synchronous data parallelism for DNNs training. 
The main bottleneck of the synchronous approach is the compu- 
tation to communication ratio. The synchronous parallelization 
of DNN training requires the communication of the model w, 
and the computed gradients Aw, between all nodes in every it- 
eration t. Since w has to be synchronous in all nodes and Aw;+1 
can not be computed before w, is available, the entire commu- 
nication has to be completed before the next iteration t+ 1. The 
problem is that w and Aw have the size of all weights in the neu- 
ral network, which can be hundreds of megabytes. The compute 
times per iteration are rather low and decrease when scaling 
to more nodes. Depending on the model size and layout, the 
training problem becomes communication bound after scaling 
to only few nodes. Shallow networks with many neurons per 
layer (like AlexNet) scale worse than deep networks with less 
neurons (like GoogLeNet) where longer compute times meet 
smaller model sizes. 


A second problem of the synchronous approach is that nodes 
process mini-batch-splits instead of mini-batches, and the size 
b of these splits depends on the number of nodes n. If b is 
too small (less than 32 samples in our experiments), there will 
be a negative impact on the inner parallel computation (within 
the node), especially in the case of the FullyConnected (FC) 
layers. One solution would be to increase the mini-batch size 
in parallel to the number of nodes, but too large batch sizes 
have been shown to cause slowdown in convergence and de- 
grade the generalization properties of the trained model (Kurth 
et al. (2017)). As larger mini-batch sizes enable lower abso- 
lute training times when many nodes are used, they are always 
preferred, with the limitation of accuracy degradation. While 
intra-node parallelism of fully connected layers improve with 
a larger mini-batch size, the poor scalability results observed 


at low core counts can be attributed to a poor intra-node par- 
allelism of the other layers (dropout, pooling and LRN). The 
observed behavior is consistent with the results from Keuper 
and Pfreundt (2016). Another aspect negatively impacting scal- 
ability at low core counts (few nodes) can be related to data 
loading. Our implementation uses asynchronous data prefetch- 
ing (as it is the default DL4J behavior). The next mini-batch- 
splits are loaded in another thread of the worker while training 
is proceeding in the main thread. Under ideal circumstances 
(fast disk access and small mini-batch-split size), asynchronous 
prefetching implies negligible data loading delays (except on 
the first iteration). However, MareNostrum nodes are equipped 
with relatively slow local disks (IBM 500 GB 7.2K 6Gbps NL 
SATA 3.5), a circumstance that can turn data loading into a 
bottleneck when mini-batch-splits are too big (i.e. when large 
mini-batches are distributed among few nodes) and the network 
is shallow. 

A third problem is stragglers. The duration of the iteration 
depends on the slowest node. This effect gets worse with scale. 

Asynchronous parallelization, not possible with the current 
version of DL4J, would solve these problems but, as mentioned 
before, has the stale gradient problem (though our nodes are ho- 
mogeneous and the impact would be low). Some recent works 
like Kurth et al. (2017) propose a hybrid approach in which syn- 
chronous parallelism just takes place within groups of nodes. 
Partial solutions to the stale gradient problem, e.g. Nguyen 
et al. (2018)), have also been proposed. 


9. Conclusions 


The research work presented in this paper explores the 
feasibility and efficiency of using Apache Spark and DL4J 
for deploying deep learning workloads over a real-world, 
petascale, HPC setup, such as MareNostrum. To this end, 
we have designed a layered architecture consisting in both, 
general-purpose (Spark and DL4J) and custom components 
(Spark4MN). We have evaluated the deployment by training 
AlexNet and GoogLeNet over the ImageNet dataset. We have 
tested different deployment setups (number of nodes, num- 
ber of Spark workers per node, data partition size, mini-batch 
size, mini-batch-split size, averaging frequency, prefetching 
and repartitioning strategy). 

We conclude that it is feasible to rely on Apache Spark to 
deploy deep learning workloads over a traditional HPC setup. 
This approach minimizes deployment costs and enables a sys- 
tematic tuning of the different configuration parameters, both at 
application level and at infrastructure level. However, the effec- 
tive scaling is strongly limited by the synchronous parallelism 
approach applied by the latest DL4J version. Problems such as 
the communication overhead, mini-batch-split size and strag- 
glers degrade the scalability beyond 32 nodes. In order to over- 
come this limitation, it would be necessary to replace the syn- 
chronous mechanism by a hybrid approach in which synchro- 
nization just takes place within fixed-size node sets. Assessing 
the impact of certain aspects, such as a quantitative evaluation 
of the effects on performance of asynchronous data prefetching, 
deserves further investigation and will be carried out in future 
work. 
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